In [1]:
%autosave 10


Autosaving every 10 seconds

Why

  • Preprocessing to another technique
  • Exploring

Why not use scikit-learn

  • When things don't work useful to dig in with knowing how it works
  • Do use this library, sometimes do stuff by hand

K-Means

  • Convex problem, so will always find a solution but might be a local minima, not guaranteed to be the global minima.
    • Run multiple times, randomise start, pick best

 When does K-Means break down and when not

  • If features (dimensions) have different scales
    • KM assumes data well defined by centroid, and have same scales.
    • !!AI so what kind of feature scaling to use?
  • Need to scale features
  • K-Means is resistant to irrelevant dimensions (no information)
    • but if large variance in irrelevant dimension(s), fails (!!AI variance, or different scale? both?)
  • If dimension isn't Gaussian but uniform random (noise), then also breaks down.
  • Lesson:
    • be careful when adding new features / dimensions.
    • "curse of dimensionality": with too many dimensions, the actual notion of similarity breaks down. Euclidean distance with too many dimensions less useful.
  • If non-isotropic (not linearly separatable), i.e. no clear centroid, will fail too

Choosing between results

  • Given two sets of clustering results, what is error?
  • Inertia: sum squared error between point and nearest centre (!!AI ?)
    • But in fact this isn't useful. As you add more clusters this always decreases, but we're not penalising the increased complexity of the solution. (regularisation)
    • Use if comparing two different clustering results given a priori assumption of number of clusters.
  • Sillhouette coefficient: considers compactness of your own cluster, in relation to distance to points not part of your cluster.
    • Use to choose optimum number of clusters.

Visualising errors

!!AI see presenter notes.

 Spectral clustering

  • Can capture non-linearly-separatable clusters. (non-isotropic).
  • Estimates number of clusters.
  • Not restricted to Euclidean distance as distance metric
    • Pass in some "similarity matrix"
  • Good for text.

!!AI see notes. requires solid linear algebra knowledge (maybe add references).

  • Looking at smallest eigenvectors, whereas PCA looks at largest eigenvectors, different problem.

  • Unnormalised spectral clustering tries to balance the size of the clusters

  • Normalised spectral clustering tries to balance the degree (density) of the clusters.
    • Ratio cut graph partitioning algorithms

Questions

  • Time complexity?
    • Not sure about analytically. But did recommendation problem with 400,000 data points on users and movies, took 30 minutes on laptop.
    • You first go O(n^2) to calculate distances, then sparsify using threshold. Couldn't you just threshold first to sparsify, then calculate distances?
      • Yes, probably could.

In [ ]: